The emerging research area of visual task planning attempts to learn representations suitable for planning directly from visual inputs, alleviating the need for accurate geometric models. Current methods commonly assume that similar visual observations correspond to similar states in the task planning space. However, observations from sensors are often noisy with several factors that do not alter the underlying state, for example, different lightning conditions, different viewpoints, or irrelevant background objects. These variations result in visually dissimilar images that correspond to the same task state. Achieving robust abstract state representations for real world tasks is an important research area that the ELPIS lab is focusing on.
This work presents a motion planning framework for robotic manipulators that computes collision-free paths directly in image space. The generated paths can then be tracked using vision-based control, eliminating the need for an explicit robot model or proprioceptive sensing. At the core of our approach is the construction of a roadmap entirely in image space. To achieve this, we explicitly define sampling, nearest-neighbor selection, and collision checking based on visual features rather than geometric models. We first collect a set of image space samples by moving the robot within its workspace, capturing keypoints along its body at different configurations. These samples serve as nodes in the roadmap, which we construct using either learned or predefined distance metrics. At runtime, the roadmap generates collision-free paths directly in image space, removing the need for a robot model or joint encoders. We validate our approach through an experimental study in which a robotic arm follows planned paths using an adaptive vision-based control scheme to avoid obstacles. The results show that paths generated with the learned-distance roadmap achieved 100% success in control convergence, whereas the predefined image space distance roadmap enabled faster transient responses but had a lower success rate in convergence
@article{chatterjee2025-image-space-roadmaps,title={Image-Based Roadmaps for Vision-Only Planning and Control of Robotic Manipulators},journal={IEEE Robotics and Automation Letters},publisher={Institute of Electrical and Electronics Engineers (IEEE)},author={Chatterjee, Sreejani and Gandhi, Abhinav and Calli, Berk and Chamzas, Constantinos},year={2025},month=jul}
Learning state representations enables robotic planning directly from raw observations such as images. Most methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space. The similarity between observations in the space of images is often assumed and used as a proxy for estimating similarity between the underlying states of the system. However, observations commonly contain task-irrelevant factors of variation which are nonetheless important for reconstruction, such as varying lighting and different camera viewpoints. In this work, we define relevant evaluation metrics and perform a thorough study of different loss functions for state representation learning. We show that models exploiting task priors, such as Siamese networks with a simple contrastive loss, outperform reconstruction-based representations in visual task planning.
@inproceedings{chamzas2022-contrastive-visual-task-planning,title={Comparing Reconstruction-and Contrastive-based Models for Visual Task Planning},author={Chamzas*, Constantinos and Lippi*, Martina and C. Welle*, Michael and Varava, Anastasia and E. Kavraki, Lydia and Kragic, Danica},booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems},month=oct,pages={12550-12557},doi={10.1109/IROS47612.2022.9981533},year={2022},url={https://doi.org/10.1109/IROS47612.2022.9981533}}
Recently, there has been a wealth of development in motion planning for robotic manipulationnew motion planners are continuously proposed, each with its own unique set of strengths and weaknesses. However, evaluating these new planners is challenging, and researchers often create their own ad-hoc problems for benchmarking, which is time-consuming, prone to bias, and does not directly compare against other state-of-the-art planners. We present MotionBenchMaker, an open-source tool to generate benchmarking datasets for realistic robot manipulation problems. MotionBenchMaker is designed to be an extensible, easy-to-use tool that allows users to both generate datasets and benchmark them by comparing motion planning algorithms. Empirically, we show the benefit of using MotionBenchMaker as a tool to procedurally generate datasets which helps in the fair evaluation of planners. We also present a suite of over 40 prefabricated datasets, with 5 different commonly used robots in 8 environments, to serve as a common ground for future motion planning research.
@article{chamzas2022-motion-bench-maker,title={MotionBenchMaker: A Tool to Generate and Benchmark Motion Planning Datasets},volume={7},number={2},pages={882–889},issn={2377-3766},doi={10.1109/LRA.2021.3133603},journal={IEEE Robotics and Automation Letters},author={Chamzas, Constantinos and Quintero-Pe{\~n}a, Carlos and Kingston, Zachary and Orthey, Andreas and Rakita, Daniel and Gleicher, Michael and Toussaint, Marc and E. Kavraki, Lydia},year={2022},month=apr,url={https://dx.doi.org/10.1109/LRA.2021.3133603}}
Representation learning allows planning actions directly from raw observations. Variational Autoencoders (VAEs) and their modifications are often used to learn latent state representations from high-dimensional observations such as images of the scene. This approach uses the similarity between observations in the space of images as a proxy for estimating similarity between the underlying states of the system. We argue that, despite some successful implementations, this approach is not applicable in the general case where observations contain task-irrelevant factors of variation. We compare different methods to learn latent representations for a box stacking task and show that models with weak supervision such as Siamese networks with a simple contrastive loss produce more useful representations than traditionally used autoencoders for the final downstream manipulation task.
@misc{chamzas2020rep-learning,author={Chamzas*, Constantinos and Lippi*, Martina and C. Welle*, Michael and Varava, Anastasiia and Alessandro, Marino and E. Kavraki, Lydia and Kragic, Danica},booktitle={NeurIPS, 3rd Robot Learning Workshop: Grounding Machine Learning Development in the
Real World},title={State Representations in Robotics: Identifying Relevant Factors of Variation using Weak
Supervision},year={2020},month=dec,url={https://www.robot-learning.ml/2020/}}